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Abstract 

We present a new algorithm for computing the maximum entropy probability 
distribution satisfying a set of constraints. Unlike previous approaches, our 
method is integrated with the planning of data collection and tabulation. 
We show how adding constraints and performing the associated additional 
tabulations can substantially speed up computation by replacing the usual 
iterative techniques with a straight-forward computation. We note, however, 
that the constraints added may contain significantly more variables than 
any of the original constraints so there may not be enough data to collect 
meaningful statistics. These extra constraints are shown to correspond to 
the intermediate tables in Cheeseman's method. Furthermore, we prove that 
acyclic hypergraphs and decomposable models are equivalent, and discuss 
the similarities and differences between our algorithm and Spiegelhalter's 
algorithm. Finally, we compare our work to Kim and Pearl's work on singly- 
connected networks. 

Portions of this thesis are joint work with Ronald Rivest. 

Keywords: maximum entropy, uncertain reasoning, probabilistic techniques, 
acylic hypergraphs, constrained optimization. 
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Chapter 1 

Introduction 



Many applications require reasoning with incomplete information. For ex- 
ample, one may wish to develop expert systems that can answer questions 
based on an incomplete model of the world. Having an incomplete model 
means that some questions may have more than one answer consistent with 
the model. How can a system choose reasonable answers to these questions? 

Many solutions to this problem have been proposed. While probability 
theory is the most widely used formalism for representing uncertainty, ad- 
hoc approximations to probability have been used in practice. The problem 
with pure probabilistic approaches is that their computational complexity 
seems prohibitive. One way to reduce the complexity of this problem is 
to make the strong assumption of conditional independence. However, this 
assumption is typically not valid and generates inaccurate results. A proba- 
bilistic method which shows promise for solving some of the problems with 
uncertain reasoning is the maximum entropy approach. 

This thesis discusses efficient techniques, based on the principle of max- 
imum entropy, for answering questions when given an incomplete model. 
The organization is as follows. In the remainder of this chapter, we for- 
mally define the inference problem and justify the use of the maximum 
entropy principle. In chapter 2, previously known methods for calculat- 



8 CHAPTER 1. INTRODUCTION 

ing the maximum entropy distribution are discussed. Then, in chapter 3, 
we present a new technique which makes maximum entropy computations 
easier by adding extra constraints, and we discuss an alternative approach 
which leads to another efficient algorithm for calculating the maximum en- 
tropy probability distribution. In chapter 4, we compare our new technique 
to some of the other techniques discussed in the thesis. Finally, chapter 5 
contains conclusions and discussion of open problems. 

1.1 Formal Problem Definition 

In this section, we formally define the inference problem which this thesis 
addresses. We begin by defining some notation. Let V = {A,B,C, . . .} be 
a finite set of binary-valued variables, or attributes. (The generalization 
to finite- valued variables is straightforward.) Consider the event space fiy 
defined to be the set of all mappings from V to {0, 1}. We call such mappings 
assignments since they assign a value to each variable in V. It is easy to see 
that |fiy| = 2' V L If E C V, we have Qy is isomorphic to SIe X &v-E\ we 
identify assignments in 0<e with subsets of ily in the natural manner. 

We are interested in probability distributions defined on Sly We use 
the following convention throughout this paper. If E C V, we write P(E) 
to denote the probability of an element of £Le Q fiv- I n other words, we 
specify only the variables involved in the assignments and not their values. 
For example, 

P(V) = P(A)P(B)P(C)-.. (1.1) 

represents 2^ equations, stating that the variables are independent. (We do 
not assume equation (1.1).) By convention, all assignments in an equation 
must be consistent. We also write P(A) instead of P({A}), P(AB) instead 
of P({AB}), and so on. 

We use a similar convention for summations: YIe stands for a summation 
over all assignments in ft^, when E C V. Using these conventions, we see 
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that Y C E CV implies that 

P(Y) = £ P{E) (1.2) 

For example, if £ = {A5C£>} and Y = {AB} then P(Y) = Ecd ^W- 

For conditional probabilities we use similar notation. For Y C E C V 
the probability of JE given Y is written as P(E\Y) and defined to be 

P(E\Y) = P{E - Y\Y) = Pi p { Yp . (1-3) 

We say that X and Y are conditionally independent given 5 if 

P(X A YjS) = PpTlSOPCYlS) (1.4) 

where X,Y,SC V. 

We are interested in probability distributions on fly satisfying a set of 
constraints. We assume that the constraints are supplied in the form of 
joint marginal probabilities. Let E\, . . . , E m be distinct but not necessarily 
disjoint subsets of V . Let us suppose that for each i we are given the 2^1 
constraint values {P(E{)}. Furthermore, we assume that these values are 
consistent. By consistent we mean that there exists at least one probability 
distribution on ily which satisfies the constraints. Note that equation (1.2) 
states that a constraint on the values P(Y) is implied by a constraint on 
the values P(E) when Y C E. A common way of ensuring that the con- 
straints are consistent is to derive the constraints by computing the observed 
marginal probabilities from a common set of data x . 

Many techniques require that constraints are given in the form of con- 
ditional probabilities. Here, each element of £"i, . . . , E m has a distinguished 
subset e;, and the input is the 2^*1 constraint values {P(E{ — e t |e;)} for 
each i. This approach is frequently used when obtaining information form 



1 Using "experts" to provide subjective probability estimates is a well-known way of 
deriving a set of inconsistent constraints [36]. 
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experts because it is often easier for experts to give information in terms 
of conditional probabilities. In general, we find that joint marginal distri- 
butions are easier to handle. Since we plan to obtain the constraint values 
from raw data, we are free to use joint marginal distributions. By doing 
so, we avoid both the possibility of inconsistent data, and the difficulty in 
transforming conditional distributions to joint marginal distributions. 

In general, there may be many probability distributions satisfying the 
constraints. There are two problems which must be addressed. First, assum- 
ing that there are many probability distributions satisfying the constraints, 
which one should be chosen? And second, how can one efficiently calculate 
the desired distribution? Most of our attention shall be given to the sec- 
ond of these two problems, but we briefly address the first in the following 
section. 

1.2 The Maximum Entropy Principle 

In this section, we formally define the maximum entropy principle and sum- 
marize arguments justifying its use. When faced with an underconstrained 
problem, a reasonable way to get a unique answer is to apply the principle 
of maximum entropy. The entropy function, H , is defined as follows: 

H(P) = -^P(V)\og(P(V)). (1.5) 

V 

The maximum entropy probability distribution, P*, is the unique distribu- 
tion which maximizes H while satisfying the supplied constraints. Motiva- 
tion for this choice are given by Jaynes, Rissanen, Shore and Johnson, and 
Tikochinsky, Tishby and Levine [17,18,19,27,30,35]. We summarize their 
arguments in the following two sections. 
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1.2.1 Informal Justification 

In this section, we provide intuitive arguments as to why the probability 
distribution that maximizes the entropy function is the appropriate choice 
to make when dealing with an underconstrained problem. 

Informally, the maximum entropy principle says that when one makes 
inferences based on incomplete information, one should draw them from 
the probability distribution that has the maximum uncertainty permitted 
by the data. That is, the maximum entropy distribution is the unique 
distribution which is maximally noncommittal with regard to missing infor- 
mation. There are many intuitive reasons why probability distributions of 
high entropy are favored over others [17,18]. Some such reasons are that 
the distributions of higher entropy assume less, are more probable, and are 
smoother. From an information theoretic viewpoint, the possible distribu- 
tions are concentrated near the one of maximum entropy. That is, given 
incomplete information, the maximum entropy distribution is not only re- 
alized in the greatest number of ways, but for large N the overwhelming 
majority of all possible distributions compatible with the data have entropy 
very close to the maximum. Thus to choose an estimate other than the 
one that maximizes entropy would amount to ignoring the vast majority of 
all the possibilities allowed by the data, and concentrating on a small and 
unrepresentative subclass of them. 

1.2.2 Formal Justification 

While the informal justification of the previous section provides a convinc- 
ing argument for applying the maximum entropy principle, one does not 
have to rely on these intuitive arguments. In this section, we discuss some 
formal arguments for choosing the probability distribution which maximizes 
entropy. 

We would like any general method of inference to obtain identical solu- 
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tions if the same input is processed in different yet equivalent ways. Since 
the maximum entropy principle is designed as a general method of infer- 
ence, one would like to show that it maintains consistency. Johnson and 
Shore [30,19] prove the even stronger result that the principle of maximum 
entropy is the only method of inference which satisfies a set of consistency 
properties. They prove that the principle of maximum entropy is correct in 
the following sense: maximizing any function but entropy will lead to in- 
consistency unless that function and entropy have identical maxima. They 
give the following set of consistency axioms: 

1. Uniqueness: The result should be unique. 

2. Invariance: The choice of coordinate system should not matter. 

3. System Independence: It should not matter whether one accounts 
for independent information about independent systems separately in 
terms of different densities or together in terms of a joint density. 

4. Subset Independence: It should not matter whether one treats an in- 
dependent subset of system states in terms of a separate conditional 
density or in terms of the full system density. 

Then Johnson and Shore prove that if the input is in the form of constraints 
on expected values, the unique distribution which satisfies the constraints 
and the consistency axioms is the one obtained by maximizing the entropy 
function. 

Some very interesting results which are also based on satisfying a consis- 
tency property are provided by Tikochinsky, Tishby, and Levine [35]. They 
assume the event space is ily and the constraints are m expected values 
E(i) where i = 1, . . . , m. The goal is to find a probability distribution P 
which fulfills 

E(i) = ^P(V)A(i,V) (1.6) 

v 
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These constraints are more general than the constraints discussed in section 
1.1. If we restricted A(i,V) to be either meaning element V should not 
be included or 1 meaning element V should be included, and E(i) to be the 
sum of the probabilities of the included elements, then we get constraints 
which are joint marginal probability distributions. 

Now assume that assignment v occurred N v times in N independent 
trials. Let N = (JV l9 . . . , N\q v \) be any particular distribution and define 

N \ V / 

In other words, E'(i) is the expected number of occurrences of event i if N 
independent (reproducible) trials were performed. Now there are two ways 
to compute P^: 

1. Apply some algorithm (a) to the constraints E to get P and then 

2. First apply equation (1.7) to get E' and then apply algorithm (a) to 
E 1 to get P$. 

They define algorithm (a) to be consistent if these two routes yield the 
same results. And finally, they prove that algorithm (a) is consistent if and 
only if algorithm (a) is the maximum entropy procedure, thus justifying the 
maximum entropy principle. 

Finally, one other way in which the maximum entropy principle can be 
justified is to justify a more general technique. Rissanen's work [27] provides 
this type of justification. His work deals with the principle of minimum 
description length (MDL). The MDL principle is based on minimizing the 
total number of binary digits required to rewrite the observed data. He 
justifies the MDL principle by arguing that it correctly expresses one's initial 
ignorance. He shows that when the parameters determine the data the MDL 
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principle degenerates to the principle of maximum entropy. Therefore, one 
could argue that in such a case maximizing entropy is desirable. 

1.3 Log— Linear Representations 

Provided that the maximum entropy distribution is the one of choice, we 
would like an algorithm that calculates it efficiently. The remaining chapters 
of this thesis will address this issue. One problem that is immediately appar- 
ent is that space required to store a probability distribution is exponential 
in \V\. In this section, we discuss an efficient way to store the maximum 
entropy probability distribution. 

One advantage of the maximum entropy distribution, P*, is that it has 
a simple representation. For each u in fi^;., there is a non-negative real 
parameter oj£;.(u>) (i.e., one parameter set per constraint set and 2' £7 »l pa- 
rameters in the parameter set for E{), that determine P* as follows. Let us 
write oti(u) instead of a^^u;) for brevity, and omit the argument u when it 
can be deduced from context. Now we may simply write 

P*(V) = ai a 2 ...a m . (1.8) 

Each element of Sly is assigned a probability which is the product of the 
appropriate a's where each a determines its argument from the assignment 
to V . This is known as a log-linear representation. For example, suppose we 
have the variables A, B,C and constraint sets E\ = {AB} and E<i = {BC}. 
Then we will have the variable sets a A B and a B c, where the corresponding 
variables are 0^(00), . . .,aA B (H) and a B c(00), . . .,a^c(ll)- The maxi- 
mum entropy distribution is given by the log-linear model 

P*(V) = a AB a BC . 

K Pabc(010) (i.e., the probability that A = 0, B = 1, and C = 0) is desired, 
it can be calculated by 

P*(010) = a AB (01)a B c(l0). 
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Chapter 2 

Previous Work 



In the previous chapter, we argued that the desired probability distribu- 
tion is the one which maximizes the entropy, and discussed a way to ef- 
ficiently store the maximum entropy distribution. The essential question 
which remains is: How can we efficiently calculate the maximum entropy 
distribution? The next two chapters will address this question. 

In principle, the problem of computing the maximum entropy distribu- 
tion is solved by the classical Lagrange multipliers method. That is, we want 
to maximize equation (1.5) subject to constraints of the form 

£ P(V)A(i, V) = 6(z) where i = 1, . . . , m. (2.1) 

v 

These constraints are of the general form introduced in section 1.2.2 when 
discussing Tikochinsky, Tishby and Levine's work. Now, applying the La- 
grange multiplier method we find that the maximum entropy distribution is 
given by 

i>*(y) = I e xp(-f>A(i,V)) (2.2) 



where 



Z = £exp -£>A(i,y) (2-3) 

v \ t=l / 
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and the A;'s are determined by the constraints. (We note that the a's de- 
fined in section 1.3 come from the Lagrange multiplier technique, in fact a t - = 
e~ A *.) Thus, calculating P* is just the problem of solving a set of m non- 
linear equations to obtain Ai, . . . , A m (or equivalently ai, . . . , a m ). Math- 
ematically this problem has been solved with such techniques as Newton- 
Raphson, modified Newton- Raphson, and the conjugate gradient method 
[33]. So, in principle, we know how the calculate the maximum entropy dis- 
tribution, but these techniques are iterative and require exponential time. 
Our goal is to find a more efficient algorithm to obtain the maximum entropy 
distribution. 

In section 2.1, we discuss iterative methods to solve the set of non-linear 
equations for the a's. While in the general case it is necessary to solve for the 
a's to obtain the maximum entropy distribution, if appropriate restrictions 
are made then the maximum entropy distribution can be obtained without 
iteration. In section 2.2, some non-iterative techniques are introduced. 

2.1 Iterative Maximum Entropy Methods 

Most existing methods for calculating the maximum entropy distribution 
are iterative. They typically begin with a representation of the uniform dis- 
tribution and converge towards a representation of the maximum entropy 
distribution. Each step adjusts the representation so that a given constraint 
is satisfied. To enforce a constraint P(Ei), all of the elementary probabil- 
ities P(V) relevant to that constraint are multiplied by a common factor. 
Because constraints are dependent, adjusting the representation to satisfy 
one constraint may cause a previously satisfied constraint to no longer hold. 
Thus, one must iterate repeatedly through the constraints until the desired 
accuracy is reached. (We note that the implicit constraint — that the prob- 
abilities sum to one — must usually be explicitly considered.) Examples of 
this type of algorithm are discussed in [5,6,13,16,20,22]. 
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Representing the probability distribution explicitly as a table of 2l v l 
values is usually impractical. For this problem, it is most convenient to 
store only ai,a2,...a m ; this is a representation as compact as the input 
data, which represents the current probability distribution implicitly via 
equation (1.8). To represent the uniform distribution, every a is set to 
1, except for the a corresponding to the requirement that entries of the 
probability distribution must sum to 1 — which is set to 2"^. To determine 
if a constraint is satisfied, one must sum the appropriate elements of the 
probability distribution. Any particular element can be computed using 
equation (1.8). If the constraint is not satisfied, the relevant a is multiplied 
by the ratio of the desired sum to the computed sum. Thus, in originally 
calculating the a's and later in evaluating queries it is necessary to evaluate 
a sum of terms, where each term is a product of a's. This sum is difficult 
to compute since it may involve an exponential number of terms. 

Cheeseman [6] proposes a clever technique for rewriting such sums in 
order to evaluate them more efficiently. For example 

a Y aAB a ACD &DE OiAEF 
A...F 

is rewritten as follows. First, Y^A„.F 1S broken into six sums, each over one 
variable. Arbitrarily choosing the variable ordering CDFEAB, we obtain 

a Y Y Y Y Y Y a ABOL A CD<XDE<XAEF* 
B A E F D C 

Now each a is moved left as far as possible (it stops when reaching a sum 
over a variable on which it depends). The above sum then becomes 

a Y Y a *B Y Y aAEF Y aDE Y, aAcD - 

B A E F D C 

The sums are evaluated from right to left. The result of each sum is an inter- 
mediate table containing the value of the sum evaluated so far as a function 
of variables further to the left which have been referenced. For example 
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after evaluating the innermost summation a table, %ad<> is kept containing 
^2c a ACD for ^AD- Then after evaluating the next sum, a table, Iae, is 
kept containing J2d a DEtAD for SIae- We continue in this manner until all 
the summations are evaluated. For this example, the 193 multiplications 
required by the naive method are reduced to only 21 multiplications. As 
we shall see, in some cases the intermediate table is most efficiently repre- 
sented as two or more smaller tables. The variable ordering must be chosen 
carefully in order to take full advantage of this technique. A poor choice of 
variable ordering can yield a sum which is not much better than explicitly 
considering all 2^1 terms; a good choice may dramatically reduce the work 
required. While picking a good variable ordering is important, the success of 
this technique depends greatly on the interconnectedness of the constraints. 
If the constraints are highly connected, no ordering can significantly reduce 
the complexity of evaluating the summation. 

Some alternative approaches to the standard iterative schemes have been 
proposed. One of the more interesting proposals is due to Geman [14,23]. 
Instead of considering one constraint at a time, this algorithm uses stochas- 
tic relaxation to simultaneously adjust the probability distribution to meet 
all of the constraints. In particular, a convex function, whose minimum 
gives the maximum entropy distribution, is calculated. Then a technique to 
approximate the gradient and a gradient descent algorithm are used to find 
this minimum. 

An approach which comes immediately from the Lagrange multiplier 
technique is discussed by Agmon, Alhassid, and Levine [1,2]. First they 
calculate the "potential function" 

m 

F(A') = log(Z(A«)) + J>j6(i) (2-4) 

where Z is as denned in equation (2.3). They show that F is strictly con- 
vex, and has a unique global minimum for A* which solves VF(A*) = 0. 
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Finally, they use the modified Newton-Raphson procedure to find the global 
minimum. 

Unlike most approaches which are based on the Lagrange multiplier tech- 
nique, Csiszar [9] uses I-divergence geometry to prove that a generalized ver- 
sion of the standard iterative technique converges. His proof is more general 
than those which are derived from the Lagrange multiplier technique. 

2.2 Non-Iterative Maximum Entropy Techniques 

The iterative techniques of the previous section are applicable in most sit- 
uations, but computationally they are rather inefficient. We want a model 
that is powerful enough to handle real-world situations, yet simple enough 
for the maximum entropy distribution to be calculated efficiently. In this 
section, we begin by discussing some non-iterative approaches which use 
conditional independence instead of the principle of maximum entropy to 
obtain a unique result. Then we introduce several non-iterative maximum 
entropy algorithms which will be discussed in more detail in chapter 3. 

Chow and Liu [8] consider the class of product approximations in which 
only second-order distributions are used. If there is a product approximation 
such that for some ordering of the variables Xi . . .x n , each Xi depends on 
at most one variable from the set {^i, . . ,ar^_i}, then this approximation 
forms a dependence tree. They discuss how to build the best dependence 
tree when supplied with the complete probability distribution. They also 
present a method to construct an optimal dependence tree from samples. 

Similar to their work is the work of Kim and Pearl [21]. They construct 
a Bayesian network where the nodes represent variables and directed links 
represent direct dependencies; all direct influences on a node come from its 
parents. We will use the following notation for stating their formula: S x is 
the set of the immediate predecessors (parents) of node x in the network, 
T x is the set of all predecessors of node x in the network, and 1Z is the set 
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of roots (sources). All conditional probabilities of the form P(x\S x ) for all 
x £ 71 and P(x) for all x £ 7£, along with the independence assumptions 
that P(x\S x ) = P(x\T x ) suffice to define the following unique probability 
distribution: 

p'oo=(nW) (ii'w.))- ( 2 - 5 ) 

\xen / \x£7i / 

They define a network to be singly- connected if there is at most one undi- 
rected path between any pair of nodes. One of the most interesting results of 
this work is that the propagation of new evidence through a singly-connected 
network can be accomplished by a network of parallel processors in time 
proportional to the longest path in the network. Pearl [26] addresses the 
problem of propagating new evidence through multiply-connected networks. 

Dalkey [10] has shown that if a Bayesian network is a tree (i.e. all 
nodes have at most one parent), equation (2.5) gives the maximum entropy 
distribution. So for a Bayesian network which is a tree, a non-iterative 
technique exists for calculating the maximum entropy distribution when the 
input is all conditional probabilities of the form P(x\S x ) for all x $ 1Z and 
P(x) for all x E TZ. 

Let's now return to the maximum entropy approach. Even with Cheese- 
man's summation technique, the general iterative algorithm of section 2.1 
has a very high (exponential) computational cost since many iterations 
through the constraints are still required before the distribution converges. 
The non-iterative techniques discussed up to this point are efficient, but 
they assume conditional independence which is rarely present. We want a 
non-iterative maximum entropy algorithm in order to reduce the time to 
compute the probability distribution. If we are willing to put restrictions 
on the supplied constraints, then this goal can be achieved. 

Malvestuto [24] provides sufficient conditions for the existence of a non- 
iterative technique when the constraints are joint marginal probabilities. 
The details are discussed in section 3.1. Our work is based on the model 
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which Malvestuto introduced. We show that by enlarging the set of con- 
straints to be considered before the data is gathered and tabulated, the dif- 
ficulties of computing the maximum entropy distribution are substantially 
alleviated. That is, with the enlarged set of constraints, the computation 
of the maximum entropy distribution is non-iterative. The details of our 
method are discussed in section 3.2. Darroch, Lauritzen and Speed [11] 
also give sufficient conditions for the existence of a non-iterative technique 
when the constraints are joint probabilities. The details of their work are 
discussed in section 3.3. Edwards and Kreiner [12] and Wermuth and Lau- 
ritzen [37] extend this work. Some work which is similar to ours is that of 
Spiegelhalter [31,32]. He shows how to take any Bayesian network which 
has no directed cycles and convert it to an undirected representation which 
meets the restrictions of Darroch, Lauritzen and Speed. 



Chapter 3 

Efficient Maximum Entropy 
Algorithms 



In this chapter, we cover two non-iterative maximum entropy algorithms. 
In section 3.1, we introduce a way to model constraints which are joint 
marginal probability distributions as a hypergraph, and present a non- 
iterative formula for the maximum entropy distribution for hypergraphs 
which are acyclic. In practice, this formula is not very useful since typically 
constraint sets do not form acyclic hypergraphs. In section 3.1, we propose 
a new maximum entropy algorithm which is based on the observation that 
a hypergraph can be made acyclic by adding hyperedges. In other words, 
a maximum entropy computation can actually be simplified by adding con- 
straints. In section 3.3, we discuss an alternative way to graphically model 
the constraints and give a non-iterative formula for the maximum entropy 
distribution for models which are decomposable. Finally, in section 3.4, we 
describe Spiegelhalter's algorithm for estimating a probability distribution.x 

3.1 Acyclic Hypergraphs 

Our approach is based on the work of Malvestuto [24], who derived sufficient 
conditions for writing marginals of the maximum entropy distribution as a 
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product of easily calculated probabilities. In this section we introduce his 
work. We begin by describing how to model a set of variables and associated 
constraints as a hypergraph. It is interesting to note that the work on acyclic 
hypergraphs first appeared in the database literature [3,4,25]. The variables 
in our problem replace the attributes of the database, and the constraint 
sets replace the relations. If the database schema is acyclic, many problems 
can be simplified. 

A hypergraph is like an ordinary undirected graph, except that each 
edge is an arbitrary subset of the vertices, instead of just a subset of size 
two. We define the hypergraph H — (V,£) to contain a vertex for each 
variable, and a hyperedge for each constraint. For example the hyperedge 
{ABC} corresponds to the constraint set E{ — {A 7 B,C}. We say that 
hyperedge X subsumes hyperedge Y if Y C X, It is important to observe 
that the constraints on a sub-hypergraph induced by restricting attention 
to a subset of the vertices can be inferred from the original hypergraph 
constraints using equation (1.2). 

We define the graph C(H ) of a hypergraph H to be the graph whose 
vertices are those of H and whose edges are the vertex pairs {t>, w} such that 
v and w are in a common hyperedge of H . A hypergraph H is conformal 
if every clique of C(H ) is contained in a hyperedge of H . A graph G is 
triangulated if for every cycle of length greater than three, there is an edge 
of G joining two non-consecutive vertices in the cycle. Such an edge is called 
a chord of the cycle; hence triangulated graphs are sometimes called chordal 
graphs. A hypergraph H is acyclic if H is conformal and C(H) is chordal. 

An equivalent definition is that a hypergraph is acyclic if repeatedly ap- 
plying the following reduction steps results in the empty hypergraph (con- 
taining no edges and no vertices): 

1. Delete any vertices which belong to only one hyperedge. 

2. Delete any hyperedges which are subsumed by another hyperedge. 
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Graham's algorithm is the procedure of applying reduction steps 1 and 2 
until either the empty set is reached, or neither can be applied [15]. 

Before proceeding, we shall define some notation regarding the above 
reduction procedure. Let £(°) = {e[°\ . . .,E$}, where e\ 0) is the i th 
hyperedge of H . Let Y^ * be the set of variables which appear in at least 
one hyperedge other than e\ \ Finally let £(* +1 ) be the result of applying 
reduction step (1) and then (2) to £(*). If H is acyclic then there exists an 
/ such that £(' +1 ) = 0. 

For acyclic hypergraphs, Malvestuto [24] gave the following formula for 
the maximum entropy distribution, P*(V): 

™-@s3$)(n'W) <-, 

Note that no a's are needed; the formula depends only on probabilities in the 
original input data (constraints). This formula is an immediate extension of 
the following theorem due to Malvestuto [24]. 

Theorem 1 Given the constraints E\,. .., Em, the maximum entropy dis- 
tribution is given by the following. 

P*(V\ - p ( E i)'" p ( E m) p* (v ^ 

1 j " p(y 1 )---p(y m )^ (rj 

where Y{ is the set of variables which appear in at least one hyperedge other 
than E{ and P*(Y) is the maximum entropy distribution for the constraints 

J 1 , . . . , I m . 

Proof: From the joint marginal constraints we have the following 

P{Ei) = J2 <*l"-«m 
V-Ei 
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Similarly we have, 

m) = EE «!••■«« 

Zi V-Et 

= (e«<)(eII°;) ( 3 - 3 ) 

Let /?,• = ^2 a i- Combining equations (3.2) and (3.3) from above gives: 

Zi 

«. = *§«, (3-4) 

Now writing P*(V) in its product form we get 

P*(V) = a 1 ---a m 

_ P{E,)...P{E m ) 

~ P(Y 1 )...P(Y m ) f3l --- Pm (3>5) 

We want to show that ip(Y) = fix • ■ -/3 m is P*(Y), the maximum entropy 

distribution for the constraints P{Yi). To do this, it is suffices to prove that 

the joint marginal constraints hold. 

where Z = Z x U • ■ • U Z m , so that V = 7UZ. Now, since the Z/s are 
disjoint, 

2-Zii+i jfrZ-Zi 

= IlE^i) 

= liPiYj) (3.7) 
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Substituting equation (3.7) into equation (3.6) gives: 



However since P*(E{) = P{Ei) we get P(Y { ) - ^ ip(Y), so ip(Y) satisfies 

Y-Yi 

the constraints P(Y{). ■ 

3.2 A New Maximum Entropy Method 

In this section, we propose a new procedure for calculating the maximum 
distribution. While Malvestuto's work provides a non-iterative maximum 
entropy algorithm for acyclic hypergraphs, in practice, this technique is not 
very useful since typically constraint sets do not form acyclic hypergraphs. 
While one could make a hypergraph acyclic by removing some hyperedges, 
this approach would lead to inaccurate results. In this section, we propose 
a new maximum entropy algorithm which is based on the observation that 
a hypergraph can be made acyclic by adding hyperedges. In other words, 
maximum entropy computations can actually be simplified by adding con- 
straints. The main advantage of our procedure is that it avoids the iteration 
previously required by providing a non-iterative formula for the desired an- 
swer. The major disadvantage is that the method cannot ordinarily be ap- 
plied if the data is already tabulated and the constraints already derived; the 
method requires that one "plan ahead" and tabulate additional constraints 
when processing the data. 

3.2.1 Description 

We begin by describing our algorithm. Equation (3.1) allows one to avoid 
iteration when calculating the maximum entropy distribution for schemas 
having acyclic hypergraphs. What should one do for cyclic hypergraphs? 
Our method is based on the observation that a hypergraph can always be 
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{AB,ACD,DE,AEF} 
{AD.DE.AE} 



Figure 3.1: The hypergraph consisting of the hyperedges {AB},{ACD},{DE}, 
and {AEF} is cyclic as shown be the reduction above. (Elements of Y{ are under- 
lined.) 



{ AB , ACD, APE . AEF] 

{ADE} 





Figure 3.2: The hypergraph consisting of the hyperedges {AB}, {ACD}, {ADE}, 
and {AEF} is acyclic as shown be the reduction above. (Elements of Yi are under- 
lined.) 

made acyclic by adding hyperedges. This claim is trivial to prove, since at 
worst a hypergraph can be made acyclic by adding the hyperedge containing 
all vertices. For example, the hypergraph: 

(V, £) = ({ABCDEF}, {{AB} y {ACD}, {DE}, {AEF}}) 

becomes acyclic when the hyperedge {ADE} is added (see figures 3.1 and 
3.2). 

Thus, by adding additional constraints (edges) the maximum entropy cal- 
culation can be simplified so that no iteration is required. Here is a summary 
of how our method works: 

1. We begin with a set of variables (attributes) and constraint sets deemed 
to be of interest. (Cheeseman [7] discusses a learning program which 
uses the raw data to find a set of significant constraints. Edwards 
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and Kreiner [12] also discuss how to choose a good set of constraints.) 
Here a "constraint set" is a set of variables; the intent is that dur- 
ing data-gathering there will be one joint marginal distribution table 
created for each constraint set, and the observed events will be tabu- 
lated once in each table according to the values of the attributes in the 
constraint set. For example, if {A, B,C} is a constraint set of three 
binary-valued attributes, then there will be a table of size 8 used to 
categorize the data with respect to these three attributes. This results 
in 8 constraints on the maximum-entropy distribution desired, one for 
each of the eight observed probabilities P(ABC). 

2. Construct the corresponding hypergraph H — (V,£), where there is 
one vertex for each variable and one hyperedge corresponding to each 
constraint group. 

3. Perform Graham's algorithm on H, and let H f denote the resulting 
hypergraph. If H f is the empty hypergraph, then H is acyclic, and the 
following step is skipped. 

4. Find a set X of additional hyperedges (constraint groups) which can be 
added to H f to make it acyclic. Note that any original edges subsumed 
by edges in X are eliminated. These additional hyperedges should 
be chosen to minimize the space required to store the joint marginal 
distributions. 

5. Collect data for the expanded set £ U X of constraints 1 . 

6. Apply equation (3.1) to calculate individual elements of the maximum 
entropy distribution. 



1 Our method is unusual in that it extends the set of tables (constraints) used to tabulate 
the data. To fill in the entries of a new table, the raw data must still be available in step 
(5). So, steps 1-4 may be considered to be "planning" steps. 
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With the exception of step 4, we have completely described how to per- 
form each of the steps. We now discuss how to find a good set of hyperedges 
to add to make a hypergraph acyclic. Finding the optimal set of hyper- 
edges to add is extremely similar to the minimum fill-in problem encoun- 
tered when performing Gaussian elimination on sparse symmetric matrices 
[28,29]. Since the minimum fill-in problem has been proven to be NP- 
complete [38], we conjecture that our problem is as well. There are many 
heuristics which have been studied for the minimum fill-in problem. We 
plan on using one of these heuristics, the minimum-degree heuristic, to find 
a good set of hyperedges to add. Here is our proposal for performing step 
4 of our algorithm. We define the degree of a vertex v to be the number of 
vertices in a common hyperedge with v. We begin by calculating the degree 
of each vertex in H f . Let v be the vertex with the smallest degree (break ties 
at random). Add to X the hyperedge e which contains v and any vertices 
in a common hyperedge with v. Next, modify H 1 by adding e to it and 
performing Graham's algorithm. If H f is now the empty hypergraph then 
we are done, otherwise return to the step of calculating the degree of each 
vertex. The reason for choosing this algorithm, is that it usually keeps the 
hyperedges in X as small as possible. 

3.2.2 Example 

In this section, we demonstrate our algorithm on the example of figure 3.3. 
First we must perform Graham's algorithm on H. The result is shown below, 
where elements of Y{ are underlined. 

{ AB . AC , BC E , BDF. CD } 

{AB.AC.BC.BD.CD} 
So after performing step 3 of our algorithm we have the hypergraph H* 
shown in figure 4. Now we must find the set X of hyperedges to make 
H f acyclic. We start by calculating the degree of the vertices in H'. We 
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Figure 3.3: The original hypergraph H. 



Figure 3.4: The hypergraph H' obtained by performing Graham's algorithm on 
the hypergraph H of figure 3.3. 

find that A and D both have a degree of two, and B and C both have a 
degree of three. Since A (we could have picked D) has the smallest de- 
gree we let X = {{ABC}}. After adding {ABC} to H f and performing 
Graham's algorithm, we find that H l = ({BCD}, {{BC}, {BD}, {CD}}). 
Since H f is a complete graph the only way to make it acyclic is to add a 
hyperedge which contains all vertices of H f . Thus, X — {{ABC}, {BCD}}. 
Finally after adding the hyperedges in X to those in H and removing 
any hyperedges subsumed by another, we obtain the acyclic hypergraph 
({ABCDEF},{{ABC},{BCD},{BCE},{BDF}}). Therefore, the set of 
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constraints for which the data must be collected is E\ = {ABC}, E2 = 
{BCD}, E 3 = {BCE}, and E 4 = {BDF}. Now in order to apply equation 
(3.1) we must perform Graham's algorithm on this new set of constraints. 

{ABC, BCD . BCE, BDF} 

1), 

{BCD} 



Now applying equation (3.1) we get the the maximum entropy distribution 

is as follows: 

rm _ P(ABC)P(BCE)P(BDF) 
P {V) ~ P(BC) P(BC) P{BD) P{BCD) 
P(ABC)P(BCE)P(BDF)P(BCD) 

P{BC)P(BC)P{BD) { ' } 

3.2.3 Possible Problems 

In this section, we consider possible inefficiencies of our method. First, it 
may be necessary to add "large" hyperedges containing many vertices in 
order to make the hypergraph acyclic. For example, to make the complete 
undirected graph (containing all hyperedges of size two) acyclic, one must 
add the "maximum" hyperedge containing all vertices. Since the size of the 
table corresponding to a hyperedge is an exponential function of the size 
of the hyperedge, adding large hyperedges creates a problem. Furthermore, 
the table corresponding to the maximum hyperedge is itself the probability 
distribution that we are estimating, so the above situation is clearly unde- 
sirable. This kind of behavior depends on the structure of the hypergraph; 
hypergraphs which are "highly connected" will tend to require the addition 
of large hyperedges. However, when the graph is highly connected other 
techniques seem to "blow up" as well. 

Finally, because of our method's unique approach, we have a unique 
concern. Recall that since the data is tabulated after adding the additional 
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constraints; steps 1-4 of our algorithm must be performed while the source 
of the constraints (i.e., the raw data) is still available. If the added hyper- 
edges are too large, there may not be enough data to calculate meaningful 
statistics. Tabulating 100,000 data points in a table of size approximately 
1,000 will give reasonable estimates, while tabulating them in a table of size 
approximately 1,000,000 will not. 

3.3 Decomposable Models 

In this section, we introduce the model discussed by Darroch, Lauritzen and 
Speed [11]. They prove that if the model corresponding to the constraints is 
a "decomposable model" then a non-iterative algorithm exists to calculate 
the maximum entropy distribution. We begin with some definitions. Given a 
set of joint marginal constraints Ei, . . . , E m one can construct an undirected 
graph G = (V,E) by having one vertex per variable, and putting an edge 
between any two vertices where the variables corresponding to the vertices 
are in the same constraint set. For example the graph 

G = ({ABC}, {{AB}, {BC}, {AC}}) 

corresponds to the constraint set Ei = {A, B, C}. A clique is a set of vertices 
where each vertex in the set is connected to all other vertices in the set. A 
maximal clique is a clique which can not be extended by the addition of more 
vertices. The generating class for G is C = {Ai, . . . , Ak} where Ai, . . . , A* 
are the maximal cliques of the graph G. A graph G is said to be a graphical 
model if every clique of G is contained in some E{ for i = 1, . . . ,m. (In 
other words, the members of the generating class for a graphical model are 
equivalent to the constraints which form the model.) For example, for the 
constraints {ABE}, {ADE} and {ACD} (see figure 3.5), the generating 
class is {ABE, ADE, ACD}. Therefore, this example is a graphical model. 
However, for the constraints {AB}, {BC}, and {AC} (see figure 3.6) the 
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Figure 3.5: Example of a graphical model: ABE, ADE, ACD. 



B 




Figure 3.6: Example of a non-graphical model: AB, BC, AC. 



generating class is {ABC}. This example is not a graphical model since 
{ABC} is not a subset of {AB}, {BC}, or {AC}. Recall that a graph G is 
triangulated if for every cycle of length greater than three, there is an edge of 
G joining two non-consecutive vertices in the cycle. Finally, a decomposable 
model is a graphical model which is triangulated. 

Before looking at the formula which Darroch, Lauritzen and Speed dis- 
cuss for the maximum entropy distribution of a decomposable model, we 
need some more definitions and results. If a model is decomposable its 
joint distribution can be expressed in terms of an ordering of the constraints 
Ely . . . , E m . Since we are dealing with decomposable models, this is equiv- 
alent to expressing the joint distribution in terms of an ordering of the 
maximal cliques C = {A u . . . Ak}. We shall denote V t = Ei U • • • U E t . For 
example, V m = V. There exists orderings such that E t n V t -i = c t C E Tt for 
some r t G {!,...,<-!} where 2 < t < m. Any ordering which obeys the 
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above property is said to have the running intersection property. This leads 
to the following formula for the maximum entropy distribution: 

P*(V t ) = Pfr) (f[ ^) < = 2, . . . , m (3.9) 

3.4 Spiegelhalter's Approach 

In this section, we describe a method independently introduced by Spiegel- 
halter [31,32] for calculating the maximum entropy distribution. He takes 
advantage of the fact that the joint distribution for decomposable models 
may be expressed as a simple function of the joint probabilities on the cliques 
and the clique intersections. The essence of his technique is to create a de- 
composable model from a causal graph which is the same as the Bayesian 
network of Kim and Pearl discussed in section 2.2. 

We begin with Spiegelhalter's notation. Given a causal graph Gd = 
(V, E) and v E V, let D v denote the set of direct descendants of v, and 
A v denote the set of nodes not reachable from v. Spiegelhalter makes the 
following assumptions: 

1. The causal graph has no directed cycles, 

2. for all v G V : P(v\A v ) = P(v\D v ), 

3. for all v G V : P(v\D v ) > 0, and 

4. for all v G V : P(v\D v ) is known precisely. 

The goal here is to transform Gd to G u so that G u is decomposable. 
By accomplishing this goal, a non-iterative algorithm for calculating the 
maximum entropy distribution is obtained by applying equation (3.9) to 
G u . Spiegelhalter uses the following result in transforming a directed graph 
to an undirected graph. 
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Theorem 2 Suppose a causal graph Gd has no triples of nodes x,y,z such 
that x — ► y,y — ► z but neither x — » y nor y — *■ x. Then G u is triangulated, 
and the joint distribution is decomposable. 

Proof: Theorem 2 in Spiegelhalter's paper [31]. ■ 

We now examine Spiegelhalter's technique for estimating the probability 
distribution. 

1. He assumes the input is a directed graph Gd and all conditional prob- 
abilities of the form P{v\D v ) for all v. 

2. He changes the directed graph Gd to G* d where G* d does not have any 
two unconnected nodes with a common child. (This step is required 
so theorem 2 will apply.) He does this by selectively adding an edge 
between a pair of unconnected nodes with a common child. To perform 
this step, first he labels any node with no descendants as n, then 
he labels in inverse order the next node by the rule: among those 
which have all their children labeled, choose the one which has a child 
with the lowest label, breaking ties at random. Then he uses the 
algorithm given by Tarjan and Yannakakis [34] to provide a "fill-in" 
for the ordering of the nodes given by these labels, that will make the 
graph triangulated. 

3. He transforms the directed graph G* d to the corresponding undirected 
graph G u by removing the directions on the links. 

4. As the model choose the set of maximal cliques, C = {£i, . . .,£*:}• 
Although for arbitrary graphs the problem of finding the maximal 
cliques is NP-complete, there are efficient algorithms for finding the 
maximal cliques of triangulated graphs. (See page 268 of Rose, Tarjan 
and Lueker's paper [29]). 
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Figure 3.7: The original causal graph G u - 




Figure 3.8: The undirected graph, G u , corresponding to G<*. From the ordering of 
the vertices which are shown in parenthesis, we can find an ordering of the cliques 
{ABC, BCD, CE} which obeys the running intersection property. The ordering is 
E x = {CE}, E 2 = {ABC}, E 3 = {BCD}. 

5. Finally, he applies equation (3.9) to C to obtain a formula for the 
estimated probability distribution. 

We shall work through his technique on an example. Assume we begin 
the the causal graph Gd which is shown in figure 3.7. In this graph, node 
D has parents B and C where B and C are not connected. Thus, an edge 
must be added between them. By theorem 2, the corresponding undirected 
graph, G u , is triangulated (see figure 3.8). Since the maximal cliques in 
G u are C = {ABC, BCD, CE}, the joint marginal distributions P(ABC), 
P(BCD), and P(CE) must be calculated and stored. (Recall that P(Ei) 
represents a table of 2^1 joint probabilities.) In order to apply equation 
(3.9), we must find an ordering of the cliques which obey the running inter- 
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section property. To find such an ordering, first assign any vertex (we have 
arbitrarily chosen E) the number 1. Then perform the maximum cardinal- 
ity search of Tarjan and Yannakakis [34] to order the remaining vertices. 
(See figure 3.8). Finally order the cliques according to the largest number 
assigned to any vertex in the clique. By ordering the cliques in this man- 
ner, they will have the running intersection property. Once the cliques are 
ordered we can construct the following table: 



r 


E r 


V r 


c r 


1 


CE 


CE 


- 


2 


ABC 


ABCE 


C 


3 


BCD 


ABCDE 


BC 



Applying equation (3.9) gives the following formula for the maximum en- 
tropy distribution: 

P(ABC) P(BCD) 



P*(V) = P(CE)- 



(3.10) 



P(C) P(BC) ' 

A major difference between his work and our work is that he assumes 
that the data is given as conditional probability distributions rather that 
joint marginal distributions. From these conditional probabilities he finds 
a set of joint marginal distributions which form a decomposable model. To 
obtain the needed data, he used equation (2.5) to calculate the joint prob- 
ability distributions. Since equation (2.5) only gives the maximum entropy 
distributions for networks which are singly-connected, in general Spiegel- 
halter's technique does not produce the maximum entropy distributions. 



Chapter 4 

Comparisons 



In this chapter, we compare our new technique for calculating the maxi- 
mum entropy probability distribution to Cheeseman's summation technique, 
Spiegelhalter's algorithm, and Kim and Pearl's algorithm. In section 4.1, we 
show that our algorithm is at least as efficient as the iterative algorithm with 
Cheeseman's summation technique applied to it. In section 4.2, we compare 
our algorithm and Spiegelhalter's algorithm. And finally in section 4.3, we 
compare our work to Kim and Pearl's work on singly-connected networks. 

4.1 Cheeseman's Summation Technique 

We begin by comparing our algorithm to Cheeseman's algorithm. We prove 
that for a given problem, the hyperedges (tables) added by our technique 
are like the intermediate tables used by Cheeseman's summation technique. 
The only difference between these tables is that for Cheeseman's technique 
they are half the size, since they are summed over one of the variables in 
the table. 

As we mentioned in section 2.1, the order chosen for eliminating the 
variables in Cheeseman's summation technique greatly affects the success 
of this technique. We want to determine the minimum set of intermediate 
tables that must be created for a given ordering of the variables. When 
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evaluating a sum one can visualize an imaginary "scan line" moving from 
left to right across the hypergraph, where all variables to the left of the scan 
line have already been summed over. According to the position of a scan 
line the vertices of the graph may be divided into three parts ("eliminated", 
"boundary", and "unseen") as follows: 

1. Ve = {v G V | v is left of the scan line } 

2. V B = W G V - V E I 3t> G V E : (v,v') G E } 

3. Vu = V-(V E \JV B ) 

Let Gi,£r2, . . . ,G; be the connected components of the subgraph induced 
by Ve U Vb. And let 7(G») denote the set of vertices of Ve in G;, and r(G») 
denote the set of vertices in Vb adjacent to some vertex in G{ (i.e. T(G{) 
is the "neighborhood" of G,-). Finally, T'(Gi) denotes the subset of T(G t ) 
consisting of the vertices adjacent to some vertex in Vjj or Vb* Next, we 
partition the edges of the graph as follows: 

1. E L = {(vi,Vj) | Vi G Ve and vj G Ve} 

2. E M - {(vi, Vj) | V{ G Ve and vj G Vs} 

3. E R = {(v,-, Vj ) | t? f - G Kb U Vc/ and Vj G Vs U Vc/} 

Theorem 3 Given an ordering of the vertices in G, the intermediate tables 
required by Cheeseman's summation technique are a collection of subsets of 
the vertices Xi, X2, . . . , X\ such that 

(Vi)(3j) I r'(G,-) C Xj. (4.1) 

Proof: Initially we have the summation 

a^ai---a m . (4.2) 

v 
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Since the edges of the graph are partitioned into Er, Em, and El, and edges 
in Er do not contain any vertices in Ve, equation (4.2) can be rewritten as 

E {^£Wm)wj (4.3) 

VuUV B v e 

where {a# H } represents the product of all the a's corresponding to edges in 
Er, and likewise for {a# M } and {ole l }. Next, we partition the vertices in 
Ve according to the connected components G\, . . . , G\. The subgraphs are 
partitioned so that each vertex in Vb must only be connected to elements 
in Ve which are in the same connected component. Thus, (4.3) becomes 

12 {%) 12 i<*E L }{<*E M } • ~ 12 i<*E L }{<*E M }- (4.4) 

VuUVb 7 (d) 7(d) 

Any vertex in T(G{) - T'(Gi) is not an element of Er, So, (4.4) can be 
written as: 

12 {"Br} 12 KH%}-" 12 { ft ^}{%} ( 4 - 5 ) 

VuuV B , y(Gi) y(G,) 

where V B , = r'(Gi) U • ■ ■ U T'(Gi) and 7 '(G,-) - l(G { ) U r(G,-) - r'(G,-). 

We want to replace the summations over the partitions Y(Gi), • •• > l\Gi) 
by a table containing all values needed for 52v v uv B ,- The table need not 
include the a's for the edges in El since both vertices are in 7(G t ). Similarly, 
the table need not include the a's for edges in Em where v G Vr G T(Gi) - 
T f (Gi) since both vertices are in j'(Gi). However, since the a's for edges in 
Em where v G Vr G r'(G») will not be eliminated when evaluating J2-y'(Gi)> 
they must be included in the tables. So clearly Xl,...X/ is sufficient for 
Cheeseman's algorithm if (Vi)(3j) | T'(G{) C Xj. That is, for each partition 
a table will be needed covering all possible values of T'(Gi). 

Finally, we will prove that any set Xl,...,X/ usable in Cheeseman's 
algorithm satisfies (4.1). Assume otherwise. That is (3i)(Vj) | T f (G{) % 
Xj. Let T'(G*) be the component which is not contained in any table. 
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In equation(4.5) the Y,v v uV , S oes over a ^ possible values of T f (G*). By 
definition of T f (G*) there must be at least one edge in Er where an element 
of ]?'((?*) is an endpoint. Therefore, a partial value from J2-y'(G*) w ^ ^e 
needed for J2v v uv ,• This gives us the desired contradiction. ■ 

Our algorithm requires building the additional tables which correspond 
to the constraints added to make the given hypergraph acyclic. Just as we 
did above, we would like to know what additional tables are required given 
an ordering of the vertices. We will use the following notation. 

• Let vi, . . . , v n be an ordering of the vertices. 

• Let H ■ = ( H lU = ° 

' 1 Hi-i + (f>(vi) — {e | V{ G e} otherwise 

• Let <f>(vj) = {v | v is adjacent to V{ in fT,-_i }. 



t * *r ^ / if <K^i 

Let *(,,) = | ^ u ^ Qth l v 



) = 
otherwise 



Theorem 4 Given an ordering of the vertices of a hypergraph H, the set 
of all non-empty $(v{) when added to H will make H acyclic. 

Proof: We begin by showing that in order to eliminate V{ it is necessary to 
add the hyperedge, e' containing Vi and all vertices adjacent to V{. That is, 
we must add e' = V{ U <f>(vi) = $(v{) to eliminate V{. By adding e', all edges 
in H containing V{ will be subsumed by e', and eliminated by reduction 
step 2 of Graham's algorithm. Since v{ only appears in e', it is removed by 
reduction step 1, leaving the hyperedge, e" = (f>(vi). So, when going from 
i7 z _i(the graph after eliminating ^_i) to Jf;(the graph after eliminating ^), 
the edge set <f>(vi) must be added and {e \ V{ 6 e} must be deleted. 

Now we will show that e' must be added in order to eliminate V{. If e' 
(or any edge containing e f ) was not added then at least one of the edges 
of H containing Vi will remain after applying reduction step 1. This means 
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when applying reduction step 2 both this edge and e' will contain V{, so V{ 
cannot be eliminated. ■ 

Note that sometimes a vertex may be eliminated during a step in which 
the goal is to eliminate a different vertex. So if <f>(v{) is empty no additional 
hyperedge is needed to eliminate Vi. 

Now we are ready to prove that the additional tables required by our 
algorithm are equivalent to the intermediate tables used in Cheeseman's 
technique except that Cheeseman's tables each contain one less variable. 

Theorem 5 Given the same graph (hypergraph) and ordering of the ver- 
tices in the graph, the intermediate tables X 1? ...,Xj used by Cheeseman's 
summation technique are equivalent to the non-empty {<f>(v{)} where the cor- 
responding set {$(vi)} are the hyperedges added by our technique. 

Proof: Let V{ be the vertex which is being eliminated next in our prob- 
lem. Then Ve in the Cheeseman's summation technique corresponds to 
vi,...,v t \_i (i.e. the vertices which have already been eliminated). When 
V{ is eliminated a hyperedge, e', containing those vertices in <f>(vi) is added. 
Thus, the vertices in <f>(vi) form a clique after V{ is eliminated. Let the ver- 
tices be divided into equivalence classes where any two vertices in the same 
edge (i.e. adjacent vertices) are in the same equivalence class. Clearly, when 
Vi is eliminated all vertices in <f>(v{) or adjacent to some vertex in (f>(vi) are 
in the same equivalence class. These equivalence classes are identical to the 
equivalence classes given by T(Gi), . . .,T(G/) for Cheeseman's summation 
technique. Finally any vertex in T(Gi) - T f (Gi) corresponds to a vertex 
which was only connected to vertices in Ve (which have been eliminated). 
That is, the vertices in T(Gi) - T f (Gi) are eliminated "for free", so they 
correspond to <j>(vi) which are empty. Therefore all non-empty <f>(vi) are 
equivalent to X u ...,Xi where (V»)(3j) | r'(G,-) C Xj. ■ 

To better understand theorem 5, we shall look at the example shown 
in figure 4.1. To explain this example we introduce new notation, where 
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Figure 4-1: Example used to compare our technique to Cheeseman's technique. 

the variable sets for the intermediate tables are put in parenthesis above 
the summation sign. Consider the vertex ordering DEFCAB\ given such a 
vertex ordering, theorem 3 specifies the temporary tables needed by Cheese- 
man's summation technique. 

(AB) (BC) (BF,CF) {BF) 

P *(0) = Yl aAB Yl a ACOL B C ^2 acF J2 a CE<*EF ^ a BD<*DF 
AB C F E D 

(This summation is only being used for explanatory purposes. Since the 
elements of the probability distribution sum to 1, we know that P*(0) = 1.) 
When evaluating ^£> it will be necessary to keep a table with the value of 
olbd&df for ftfiir. When evaluating YIe as we ^ as keeping the table for B 
and F, another table with all combinations of C and F must be created. 
Below are the temporary tables used by Cheeseman's method. 



BF,CF,BC,AB 

Now we consider the additional hyperedges which must be added to 
make the hypergraph acyclic. Theorem 4 defines the set of additional hy- 
peredges that make a hypergraph acyclic. First, the hyperedge BFD elimi- 
nates vertex D. Hyperedge BFD eliminates hyperedges BD and DF since 
it subsumes them. Now D is only in hyperedge BDF and so is eliminated, 
leaving hyperedge BF. Second, hyperedge CFE eliminates vertex E. Now 



4.1. CHEESEMAN'S SUMMATION TECHNIQUE 45 

vertex F is only in hyperedges BF and CF so BCF eliminates it. Finally, 
hyperedge ABC eliminates the remaining vertices. So, the following addi- 
tional hyperedges will make our example graph acyclic (in parenthesis is the 
variable eliminated by adding the edge): 

BF(D), CF(E), BC(F), AB{C) 

Ignoring the variables in parentheses, these are identical to Cheeseman's 
tables. Nevertheless, there are important differences between these methods. 

First, in terms of time complexity, Cheeseman's method specifies an 
iterative approximation of the as, whereas our method requires no such 
iteration. So, if Cheeseman's method requires 10 iterations on the average, 
our method should yield an average speed-up of a factor of 10. 

Second, in terms of space complexity, both methods use approximately 
the same amount of space. However, our method adds what might be called 
"permanent" edges, since they correspond to tabulations of the raw data. 
Note, however, that new edges may subsume and eliminate original edges, 
so the space required by our method may not be quite as great as it first 
appears. In Cheeseman' method the tables exist only temporarily during 
the course of the computation, and not all such tables may be needed at the 
same time. 

And finally, in terms of the "precomputation" needed, both methods 
need to compute a vertex ordering to use. We observe that a good summa- 
tion ordering is a good ordering for eliminating vertices. So the problem of 
choosing the hyperedges to make a graph acyclic is essentially equivalent to 
the problem of choosing an optimal summation ordering. Therefore, we con- 
clude that our algorithm will be generally more efficient than Cheeseman's 
algorithm where both are applicable. 
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4.2 Spiegelhalter's Algorithm 

We now compare our algorithm to Spiegelhalter's algorithm. We have shown 
that a graphical model is decomposable if and only if the corresponding hy- 
pergraph is acyclic. We will prove that if the decomposable model obtained 
by Spiegelhalter's technique is the same as the acyclic hypergraph obtained 
after step 4 of our technique, then these techniques produce the same for- 
mula for the estimated probability distribution. 

We begin by proving that a graphical model is decomposable exactly 
when it is acyclic. 

Theorem 6 A graphical model is decomposable if and only if the corre- 
sponding hypergraph is acyclic x . 

Proof: Given a set of attributes and constraints, construct a graph G with 
a vertex for each attribute and edge connecting the vertices corresponding 
to each pair of attributes contained in the same constraint. Likewise, con- 
struct a hypergraph H with a vertex for each attribute and a hyperedge 
corresponding to each constraint. By definition, G is decomposable if it is 
graphical and triangulated, and a if is acyclic if it is conformal and C(H) is 
chordal. By definition G is decomposable exactly when H is conformal. We 
also note that G and C(H) are the same graph, so G is triangulated exactly 
when C(H) is chordal. Therefore, G is graphical if and only if H is acyclic. 
■ 

The essence of our algorithm is to take an arbitrary hypergraph and 
transform it into an acyclic hypergraph. Likewise, Spiegelhalter's algorithm 
takes an arbitrary directed graph and transforms it into a decomposable 
model. From theorem 6, we know that a model is decomposable if and only 
if the corresponding hypergraph is acyclic. So the first question one may ask 

1 Malvestuto [25] pointed out that both acyclic hypergraphs and decomposable models 
have been shown to obey the running intersection property, which implies that these 
models must be equivalent. 
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is how does our technique and Spiegelhalter's technique compare given that 
they arrive at the same acyclic hypergraph. In the next theorem, we prove 
that the algorithm Spiegelhalter's technique uses once he has a decomposable 
model produces the same formula for the estimated probability distributions 
as the algorithm our technique uses once we have an acyclic hypergraph. 

Theorem 7 Given an acyclic hypergraph (decomposable model) as input, 
equation (3.9) and equation (3.1) give the same formula for the estimated 
probability distribution. 

Proof: First of all since equation (3.9) applies to decomposable models and 
equation (3.1) applies to acyclic hypergraphs, by theorem 6 these equations 
apply to the same graphs. Equation (3.9) can be rewritten as 

^■HM 1 (4 - 6) 

and equation (3.1) can be rewritten as 

J - 



, .'-iup E k+1) ) 

* K* J- Ml- 

We know that 



H^Hn^nii T.w i- w 



nn^) = IK- 



co) 



r=l 



since those are just the product of the supplied joint marginal probabilities. 
So all we must do show that the second parts of equations (4.6) and (4.7) 
are equivalent. 

Order the edges in £(°) according to the level reached in the reduction 
procedure so that the earlier an edge is eliminated the higher it is numbered. 
Now we will show that this ordering will obey the running intersection prop- 
erty. Let V t = Ei U • • • U E t where E x is the first edge in the ordering, and 
Et is the last edge in the ordering. By definition 



■ W = \[\Ei k) ]r[Ei k \ (4.8) 
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We are interested in the value of Y^ * at the value for k when edge E{ is 
eliminated. In this case, the term U>^2?j ' will consist of the union of what 

(k) 

remains after step k of all the edges ordered before E{. That is, Uj^iEj 
contains all vertices in edges in T^_i except for those which have been elim- 
inated because they were only in one hyperedge. Since these eliminated 
vertices could not be in E\ \ it follows that for the step k in which Yf is 
eliminated 

Y} k) = V-_i n E\ h) = Vi-r H E^ = Vi-t n E { . (4.9) 

Furthermore, from the ordering of the hyperedges, we know that each hy- 

( k) 

peredge is subsumed by a predecessor in the ordering. So, Y { y C Ej for 
i > j. Thus, from equation (4.9) is follows that 

E { n V-_i = a = Y^ k) C Ej where % > j (4.10) 

which proves that the running intersection property holds. 
Now let's look at 

•'f; n,p(4 w ') 
Ji n,p(^>) 

For any k in which Y^ k) is not eliminated, EJ k+1) = Y^ k) . Therefore all 
terms will cancel except those for which Yf ' is eliminated in step k. We 
have argued above that in this case, Y^ k ' = E{ D VJ_i = c;. So we have that 

And thus equations (4.6) and (4.7) are the same. ■ 

So we know that if our technique and Spiegelhalter's technique obtain 
the same acyclic hypergraph for a problem, then they will arrive at the same 
formula for the estimated probability distribution. 
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We now consider the additional data required by our technique and the 
additional data required by Spiegelhalter's technique. As we have men- 
tioned, adding hyperedges to the original hypergraph corresponds to requir- 
ing joint marginal probability distributions which were not included in the 
original data. Now let's look at the data requirements for Spiegelhalter's 
algorithm. In order to get a decomposable model, Spiegelhalter must add 
edges to the original directed graph. In doing so, he may form cliques in the 
corresponding undirected graph which are larger than any of the original 
maximal cliques. These large cliques correspond to joint marginal probabil- 
ity distributions which are not contained in the original data. In fact, these 
two techniques will require the same additional data exactly when the two 
techniques obtain the same acyclic hypergraph. 

Finally, let's compare how we propose getting the additional data to 
how Spiegelhalter proposes doing so. We propose that this additional data 
is collected with the original data by having the first four steps of our al- 
gorithm be "planning" steps. On the other hand, Spiegelhalter's technique 
assumes conditional independence to get the additional data from the given 
constraints. That is, he proposes to obtain these joint marginal distribu- 
tions by applying equation (2.5) to the directed graph given as input. Now 
in the case where the directed graph is a tree (singly-connected) then the 
joint marginal distribution obtained will be the maximum entropy distri- 
bution. However, when the directed graph is not a tree, the resulting joint 
marginal distribution will not be the maximum entropy distribution. There- 
fore, even in the case where our techniques arrive that the same formula for 
the estimated probability distribution, the actual estimates for the com- 
plete probability distribution may differ since the data for the added joint 
marginal distributions may be different. 
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4.3 Kim and Pearl's Work 

In this section we will compare acyclic hypergraphs to singly-connected 
Bayesian networks. Kim and Pearl's technique for propagating new evidence 
through the network depends on the network being singly-connected 2 . A 
Bayesian network is a directed acyclic graph; such a network is said to be 
singly- connected if it has no undirected cycles. 

Given a network JV, we define a corresponding hypergraph G as follows. 
The vertices of G are the nodes of N. For each node a: in TV we create 
a hyperedge in G, consisting of the corresponding vertex and the vertices 
corresponding to all immediate predecessor of x in TV. 

Theorem 8 If a Bayesian network N is singly- connected , then the corre- 
sponding hypergraph G is acyclic. 

Proof: Since N is singly-connected, it contains no undirected cycles, by 
the definition of singly-connected. Since an acyclic undirected graph is a 
tree or forest, N must contain some node s with degree at most one. The 
corresponding vertex in G is contained in exactly one hyperedge and so 
can be eliminated by reduction step 1. Finally, we must show that the 
reduced network N* so obtained corresponds to the reduced hypergraph G f 
that remains after eliminating the vertex corresponding to s. When s is a 
source in N (or a sink which is the second to last node) this correspondence 
is obtained immediately. In the remaining cases, the correspondence holds 
only after applying reduction step 2 to eliminate the hyperedge remaining 
after s is eliminated. (This hyperedge contains only the parent of s.) The 
network N f is singly-connected. By induction, every node in G can be 
eliminated. Thus, G is acyclic. ■ 

Theorem 9 A Bayesian network is not necessarily singly-connected if the 
corresponding hypergraph is acyclic. 

2 Pearl [26] has examined ways to extend his algorithm to multiply connected networks. 
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Proof: We prove this by means of an example. The following Bayesian 
network is not singly-connected since there is an undirected cycle. 

B C 




The hypergraph corresponding to the above network is acyclic as shown 

by the reduction below: 

{ABC. BCD} 

{BC} 



■ 

We know that in the case where the Bayesian network is a tree, equation 
(2.5) gives the maximum entropy distribution, and when the network is 
not a tree, equation (2.5) does not give the maximum entropy distribution. 
However, if the independence constraints assumed by Kim and Pearl are 
supplied as additional constraints to a maximum entropy algorithm, then 
the two techniques give equivalent results. This is because the conditional 
independence constraints uniquely define a probability distribution. So, in 
some sense one could argue that a maximum entropy algorithm is more 
general that the one used by Kim and Pearl. 



Chapter 5 

Conclusions and Open 
Problems 



We have presented an efficient algorithm for calculating the maximum en- 
tropy distribution for a given set of attributes and constraints. Using a hy- 
pergraph to model the attributes and constraints, we have shown the benefits 
of making the corresponding hypergraph acyclic. We also have shown how 
to make a hypergraph acyclic by adding hyperedges (constraints). We have 
proved that our technique is at least as efficient as Cheeseman's method. 
Furthermore, we proved that acyclic hypergraphs and decomposable models 
are equivalent properties. Also we demonstrated that the formula for the 
maximum entropy distribution which is derived for acyclic hypergraphs by 
our technique and the formula given for decomposable models by Spiegelhal- 
ter's technique are equivalent if given the same undirected graph as input. 
The significant difference between these techniques is that we "plan ahead" 
so that the additional data required is available. Finally we have compared 
our work to Kim and Pearl's work on singly-connected networks. 

An open problem is to determine whether or not the problem of choosing 
the best set of hyperedges which will make a hypergraph acyclic is an NP- 
complete problem. We conjecture that this is the case, but have not yet been 
able to exhibit a proof. We would like either to find an NP-completeness 
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proof, or to find a polynomial time algorithm to solve the problem. Given 
our conjecture that the constraint addition problem is NP-complete, we 
would like to explore heuristics for finding a good set of hyperedges to add. 

Another direction of future research is to determine how well our new 
algorithm works on real-life problems. We intend to try our technique on 
some realistic examples. Our goal is to determine if the size of the hyper- 
edges will remain within reasonable limits for such realistic examples. We 
expect that in practice our new method will give substantial improvements 
in running time. Finally, we will study the effects on accuracy of keeping 
tables which may be larger than the original tables. 

Another interesting problem is to find a condition which ensures the 
existence of a non-iterative algorithm for approximating the maximum en- 
tropy distribution when the input can consist of both joint probabilities 
and conditional probabilities, as well as some clearly defined independence 
constraints. Furthermore, since the problem of finding the maximum en- 
tropy distribution is a non-linear optimization problem, it is interesting to 
ask if the techniques explored in this thesis could be applied to the general 
problem of non-linear optimization. 
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